Information-theoretic analysis of protein sequences shows that amino acids self-cluster.
نویسندگان
چکیده
We analyse for each of 20 amino acids X the statistics of spacings between consecutive occurrences of X within the well-characterized Saccharomyces cerevisiae genome. The occurrences of amino acids may exhibit near random, clustered or smoothed out behaviour, like one-dimensional stochastic processes along the protein chain. If amino acids are distributed randomly within a sequence, then they follow a Poisson process, and a histogram of the number of observations of each gap size would asymptotically follow a negative exponential distribution. The novelty of the present approach lies in the use of differential geometric methods to quantify information on sequencing of amino acids and groups of amino acids, via the sequences of intervals between their occurrences. The differential geometry arises from an information-theoretic distance function on the two-dimensional space of stochastic processes subordinate to gamma distributions-which latter include the random process as a special case. We find that maximum-likelihood estimates of parametric statistics show that all 20 amino acids tend to cluster, some substantially. In other words, the frequencies of short gap lengths tend to be higher and the variance of the gap lengths is greater than expected by chance. This may be because localizing amino acids with the same properties may favour secondary structure formation or transmembrane domains. Gap sizes of 1 or 2 are generally disfavoured, 1 strongly so. The only exceptions to this are Gln and Ser, as a result of poly(Gln) or poly(Ser) sequences. There are preferences for gaps of 4 and 7 that can be attributed to alpha -helices. In particular, a favoured gap of 7 for Leu is found in coiled coils. Our method contributes to the characterization of whole sequences by extracting and quantifying stable stochastic features.
منابع مشابه
A graph-based clustering method applied to protein sequences
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques...
متن کاملA generalization of Profile Hidden Markov Model (PHMM) using one-by-one dependency between sequences
The Profile Hidden Markov Model (PHMM) can be poor at capturing dependency between observations because of the statistical assumptions it makes. To overcome this limitation, the dependency between residues in a multiple sequence alignment (MSA) which is the representative of a PHMM can be combined with the PHMM. Based on the fact that sequences appearing in the final MSA are written based on th...
متن کاملPhylogenetic and sequence analysis of the growth hormone gene of two sturgeons, Huso huso and Acipenser Gueldenstaedtii
In this study, the cDNA Growth Hormone (cGH) of the Belugasturgeon (Husohuso) and Russian sturgeon (Acipensergueldenstaedtii) were cloned and sequenced, and phylogenetic relationships were examined using nucleic acid and amino acid sequences. The nucleotide sequence of the Beluga GH has an open reading frame of 645 nucleotides encoding a protein 214 amino acid residues. The signal peptide cleav...
متن کاملAn Evolutionary Relationship Between Stearoyl-CoA Desaturase (SCD) Protein Sequences Involved in Fatty Acid Metabolism
Background: Stearoyl-CoA desaturase (SCD) is a key enzyme that converts saturated fatty acids (SFAs) to monounsaturated fatty acids (MUFAs) in fat biosynthesis. Despite being crucial for interpreting SCDs’ roles across species, the evolutionary relationship of SCD proteins across species has yet to be elucidated. This study aims to present this evolutionary relationship based on amino aci...
متن کاملEffect of Various Levels of Protein in Diet Based on Total and Digestible Amino Acids on Performance of Cobb Strain Broilers
In order to study the effects of various levels of protein in diet based on total and digestible amino acids on performance and carcass traits of broilers, an experiment was conducted as a completely randomized design with 288 broiler chicks from Cobb 500 strain in 6 treatments, each treatment consist of 3 replicates with 16 chicks per replicate. Treatments were the diets as follow: 1) diet for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of theoretical biology
دوره 218 4 شماره
صفحات -
تاریخ انتشار 2002